Please, read my outlines for the data visualization course in the the follow README file (here) and delves into learning about data visualization.
When I started coding for biology I realize on this amazing challenge about how to tell the history from a bunch of Next Generation sequencing datasets. Visualization of information (from massive data mining in special) become in a nice part of my data scientist training.
We will use high throughput sequencing dataset from a marine non model organism exposed to hidrocarbon polutant. The libraries sequenced were acqured from undifferented (Un) and sexual differented stages ( Male and female). In this experiment time 0 and three corresponded to hidrocarbon pollutant expossition (before and after, respectivily).
Libraries were preprocessing using the standar parameters within trimmomatic and the assembly were performed with trinity. Differential gene expression and annotation were perform with edgeR (an R package) and Trinotate, respectivily.
The chunk-code here correspond to my workflow used in the lab. Hope you enjoy it!
From your work directory clone the follow repo git clone https://github.com/RJEGR/July_2018_bioinfo.git. Example:
mkdir July_2018_bioinfo
cd July_2018_bioinfo
git clone https://github.com/RJEGR/July_2018_bioinfo.git
# library(devtools)
# devtools::install_github("rlbarter/superheat")
# devtools::install_github("slowkow/ggrepel")
# install_github("cstubben/trinotateR")
# devtools::install_github("ropensci/taxa")
# devtools::install_github("grunwaldlab/metacoder")
# if (!require("DT")) install.packages('DT')
# source("https://bioconductor.org/biocLite.R"); biocLite("org.Hs.eg.db"); (~ 74.3 MB)
# biocLite("ggtree")
dir <- c("/Users/cigom/Documents/GitHub/July_2018_bioinfo/infile/")
# setwd(dir)
x <- read.table(paste0(dir,"diffExpr.P1e-2_C1.matrix_control"))
And read the input file; then let’s make a head(x) from the file:
| T0F3 | T0F4 | T0M1 | T0M4 | T0UR | T0U | |
|---|---|---|---|---|---|---|
| TRINITY_DN29981_c0_g1_i1 | 158.162 | 152.974 | 0.537 | 0.000 | 0.000 | 0.000 |
| TRINITY_DN25178_c2_g1_i1 | 0.337 | 0.181 | 167.292 | 100.095 | 140.610 | 925.271 |
| TRINITY_DN23469_c2_g2_i1 | 0.000 | 0.000 | 24.217 | 0.000 | 312.915 | 3789.992 |
| TRINITY_DN27384_c1_g1_i1 | 54.064 | 150.841 | 0.000 | 0.000 | 0.000 | 0.095 |
| TRINITY_DN32136_c1_g1_i2 | 0.000 | 0.000 | 34.300 | 3.078 | 48.656 | 93.654 |
| TRINITY_DN24700_c1_g1_i3 | 0.000 | 0.000 | 2.190 | 0.000 | 50.995 | 2477.124 |
Let’s log2 transform the data. Why to use the log2 transformation of the normalized count table ? (please read an answer on a researchgate question here)
data = x # restore before doing various data transformations
data = log2(data+1)
data = as.matrix(data) # convert to matrix
data = t(scale(t(data), scale=F)) # Centering rows
Using superheat
library(superheat)
superheat(data,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
row.dendrogram = TRUE,
left.label = "none",
bottom.label.text.angle = 0,
row.title = "Differential Expressed",
column.title = "Samples")
example 1
After process the differential gene expression analysis (Ex. running the run_DE_analysis.pl from Trinity framework) we can improve the data visualization as follow:
DE <- read.table(paste0(dir,"RSEM.isoform.counts.matrix.Female_vs_Undiff.edgeR.DE_results"))
library(ggplot2)
library(scales)
p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) + geom_point()
p
This is a Volcano plot
Let’s label by Fold Change (up/down) rate and significancy by the follow condition:
# logf>abs(2) fdr < 0.05 fdr < 0.05 and logfc> abs(2)
DE$Sig <- "Non Sig or basal"
DE$Sig[(abs(DE$logFC) > 2)] <- "logFC"
DE$Sig[(DE$FDR<0.05) & (abs(DE$logFC)>2)] <- "logFC_FDR"
p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) +
geom_point(aes(color = Sig)) + theme_classic() +
scale_color_brewer()
p
Volcano; Color and fill
Add lines and axis name:
p1 <- p +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0) +
geom_hline(yintercept = -log10(0.0001), linetype = "dashed") +
geom_vline(xintercept = c(-2, 2), linetype = "dashed")
And also let’s rename x and y axis using backquote macros:
p1 + xlab(bquote(~Log[2]~ "fold change")) + ylab(bquote(~-Log[10]~italic(P)))
Volcano lines and axis labeled
# p2 <- p + xlab("Fold change (log2)") + ylab("P-Value")
Finally, let’s label the dots from the scatter:
library(ggrepel)
maxlab <- max(-log10(DE$PValue)) - 1 # select the points below the highest -log10(PValue) value to label
p2 <- p1 + geom_text_repel(
data = subset(DE, -log10(PValue) > maxlab),
aes(label = Sig),
size = 2.5,
box.padding = unit(0.2, "lines"),
point.padding = unit(0.2, "lines")
)
Finally, let’s compare the layouts:
library(gridExtra)
library(grid)
grid_arrange_shared_legend(p,p1,p2, ncol = 3)
Volcano comparison
After assembly denovo or guided transcriptome is common to map each contig to a reference in order to annotate the potential source of each transcript. In this view, blast2go is the average tool used by users to perform this analysis nevertheles blast2go is a non free tool. In spite of, Trinotate is a useful free-framework to this purpose. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases).
In personal experience, trinotate had resulted in better performance (such less computational time) than the blast2go counterpart and useful to automatically annotate one workflow at time.
In addition, trinotateR is package developed by Chris Stubben with useful functions to “wrangling” the tab-delimited Trinotate.xls result.
library(trinotateR)
## Loading required package: data.table
y <- read_trinotate(paste0(dir,"Trinotate.xls"))
knitr::kable(summary_trinotate(y))
| unique | total | |
|---|---|---|
| gene_id | 61694 | 148534 |
| transcript_id | 147454 | 148534 |
| prot_id | 55785 | 55785 |
| prot_coords | 24521 | 55785 |
| sprot_Top_BLASTX_hit | 43800 | 45692 |
| gene_ontology_blast | 14000 | 44709 |
| Kegg | 15112 | 39430 |
| eggnog | 5651 | 37535 |
| sprot_Top_BLASTP_hit | 28519 | 35660 |
| Pfam | 26217 | 34005 |
| gene_ontology_pfam | 1775 | 20111 |
| TmHMM | 7764 | 10393 |
| RNAMMER | 9 | 9 |
| SignalP | 0 | 0 |
| transcript | 0 | 0 |
| peptide | 0 | 0 |
Most of the annotations contain mutliple hits in a backtick-delimited list and each hit contains multiple fields in a caret-delimited list. For example, the second Pfam annotation below contains two hits and each hit contains a pfam id, symbol, name, alignment and e-value. The split_pfam functions splits multiple hits and fields, so the second Pfam annotation is now printed in rows 2 and 3 below.
y1 <- split_pfam(y)
## 70856 Pfam annotations
head(y1,3)[,c(2,4:7)]
## transcript pfam symbol
## 1: TRINITY_DN10004_c0_g1_i1 PF01442 Apolipoprotein
## 2: TRINITY_DN10005_c0_g1_i1 PF16489 GAIN
## 3: TRINITY_DN10007_c0_g1_i1 PF00135 COesterase
## name align
## 1: Apolipoprotein A1/A4/E domain 20-67
## 2: GPCR-Autoproteolysis INducing (GAIN) domain 17-121
## 3: Carboxylesterase family 3-85
Finally, the summary_pfam lists both, the number of unique Pfam identifiers and the total genes, transcripts and proteins with a Pfam annotation.
y2 <- summary_pfam(y1)
## 4849 rows
knitr::kable(head(y2[order(-y2$transcripts),]))
| pfam | symbol | name | genes | transcripts | proteins | total |
|---|---|---|---|---|---|---|
| PF00386 | C1q | C1q domain | 201 | 801 | 809 | 856 |
| PF00069 | Pkinase | Protein kinase domain | 279 | 594 | 594 | 611 |
| PF07714 | Pkinase_Tyr | Protein tyrosine kinase | 272 | 583 | 583 | 596 |
| PF00400 | WD40 | WD domain, G-beta repeat | 203 | 461 | 461 | 1351 |
| PF12796 | Ank_2 | Ankyrin repeats (3 copies) | 219 | 455 | 455 | 925 |
| PF00023 | Ank | Ankyrin repeat | 214 | 449 | 449 | 1031 |
The summary table also includes a count attribute with the number of unique genes, transcripts and proteins with a Pfam annotation, as well as the total number of annotations. In this example, there are 33,721 unique transcripts with a Pfam annotation and 56,642 total annotations to transcripts (since those may have more than one Pfam annotation).
c <- attr(y2, "count")
print(c)
## unique annotations
## Pfam 4849 70856
## genes 14417 25280
## transcripts 33721 56642
## proteins 34005 56757
Implement datatable to present interactive data-table: from all the available genes/transcripts annotations.
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
trinotateR to get the Gene Ontology annotation.A GO annotation is a statement about the function of a particular gene. Each GO annotation consists of an association between a gene and a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. The GO describes function with respect to three aspects: molecular function (molecular-level activities performed by gene products), cellular component (the locations relative to cellular structures in which a gene product performs a function), and biological process (the larger processes, or ‘biological programs’ accomplished by multiple molecular activities)
Reference: http://www.geneontology.org/page/ontology-documentation, 2018
go <- split_GO(y)
## 473221 gene_ontology_blast annotations
gos <- summary_GO(go)
## 15386 rows
head(gos[order(-gos$transcripts),])
## go ontology name genes
## 1: GO:0005737 cellular_component cytoplasm 5087
## 2: GO:0005634 cellular_component nucleus 4286
## 3: GO:0016021 cellular_component integral component of membrane 3839
## 4: GO:0005829 cellular_component cytosol 3508
## 5: GO:0005886 cellular_component plasma membrane 3366
## 6: GO:0046872 molecular_function metal ion binding 2771
## transcripts proteins total
## 1: 10521 8155 10550
## 2: 8687 6947 8725
## 3: 7817 6020 7832
## 4: 7300 5854 7325
## 5: 6305 4961 6315
## 6: 5815 4421 5829
And also determine the counts
cgo <- attr(gos, "count")
print(cgo)
## unique annotations
## GO 15386 473221
## genes 19818 241902
## transcripts 44415 472268
## proteins 34463 379068
Using only the translated transcripts to get the differential Expressed annotations:
got <- na.omit(go, cols = "protein")
x$transcript <- rownames(x)
m <- merge(got, x, suffix = c("transcript"), all=FALSE)
m <- m[order(m$transcript),c(1,4,5:12)]
library(DT)
datatable(m , escape=1, options = list( pageLength = 10 ) )
Let’s draw some visualizations from the annotation enrichment throught a the transcriptome assembly using the package ggpubr:
library(ggpubr)
## Loading required package: magrittr
plotgos <- head(gos[order(-gos$transcripts),], 80)
ggbarplot(plotgos, "name", "transcripts",
fill = "ontology",
color = "ontology",
palette = c("#00AFBB", "#E7B800", "#FC4E07")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7))
Gene Ontology enrichment plot (top 80 transcripts)
# facet_grid(ontology ~ .,) + theme(#strip.text.x = element_blank(),
# axis.text.x = element_text(angle = 90, hjust = 1))
And separate the ontology annotation for further analysis:
| Var1 | Freq |
|---|---|
| biological_process | 10236 |
| cellular_component | 1571 |
| molecular_function | 3579 |
You can separete the go terms to perfom further test:
MF <- gos[gos$ontology=="molecular_function",]
CC <- gos[gos$ontology=="cellular_component",]
BP <- gos[gos$ontology=="biological_process",]
Determine the similarity of two GO terms based on the annotation statistics of their common ancestor terms by computing semantic similarity among GO terms, sets of GO terms, gene products, and gene clusters, providing different methods than measure the information content (IC). To details please read the Wang’s paper published in Oxford.
First, build annotation data needed by GOSemSim via godata function. Based in figure from Gene Ontology Enrichment, we could focus on the Cellular component (CC) terms.
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
##
## first, second
## The following object is masked from 'package:base':
##
## expand.grid
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:data.table':
##
## shift
##
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="CC", computeIC=FALSE)
##
## preparing gene to GO mapping data...
go <- as.vector(CC$go)
go1 <- sample(go, 20)
go2 <- sample(go, 20)
gosim <- GOSemSim::mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)
And visualize the similarity of the GO term.
superheat(gosim,
# retain original order of rows/cols
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
#left.label = "none",
bottom.label.text.angle = 90,
row.title = "Sample 1",
column.title = "Sample 2",
bottom.label.text.size = 4,
left.label.text.size = 4
)
Finally detache (unload) org.Hs.eg.db package.
detach(package:org.Hs.eg.db, unload = TRUE)
From the annotation, lets use the blastx result in order to get the lineage from proteins anntotated.
split_blast2 <- function (x, hit = "sprot_Top_BLASTX_hit")
{
y <- x[!is.na(get(hit)), .(get(hit), gene_id, transcript_id,
prot_id)]
z <- strsplit(y$V1, "`")
n <- sapply(z, length)
z <- strsplit(unlist(z), "\\^")
if (any(sapply(z, "[", 1) != sapply(z, "[", 2)))
print("WARNING: check different values in columns 1 and 2")
NAME <- gsub("^RecName: Full=", "", sapply(z, "[", 6))
NAME <- gsub("SubName: Full=", "", NAME)
NAME <- gsub(";$", "", NAME)
NAME <- gsub(" \\{[^}]+}", "", NAME)
x1 <- data.frame(gene = rep(y$gene_id, n), transcript = rep(y$transcript_id,
n), protein = rep(gsub(".*\\|", "", y$prot_id), n), uniprot = sapply(z,
"[", 1), align = sapply(z, "[", 3), identity = as.numeric(gsub("%ID",
"", sapply(z, "[", 4))), evalue = as.numeric(gsub("E:",
"", sapply(z, "[", 5))), name = NAME, lineage = sapply(z,
"[", 7), domain = gsub("; .*", "", sapply(z, "[", 7)),
genus = gsub(".*; ", "", sapply(z, "[", 7)), stringsAsFactors = FALSE)
message(nrow(x1), " ", hit, " annotations")
data.table(x1)
}
blast <- split_blast2(y)
## 63540 sprot_Top_BLASTX_hit annotations
blast2 <- summary_blast(blast)
## 17039 rows
knitr::kable(head(blast2[order(-blast2$transcripts),]))
| uniprot | domain | genus | name | genes | transcripts | proteins | total |
|---|---|---|---|---|---|---|---|
| HS12B_HUMAN | Eukaryota | Homo | Heat shock 70 kDa protein 12B | 60 | 136 | 121 | 146 |
| C1QL4_MOUSE | Eukaryota | Mus | Complement C1q-like protein 4 | 47 | 122 | 106 | 124 |
| HS12A_MOUSE | Eukaryota | Mus | Heat shock 70 kDa protein 12A | 51 | 113 | 90 | 116 |
| HS12A_HUMAN | Eukaryota | Homo | Heat shock 70 kDa protein 12A | 53 | 99 | 64 | 99 |
| PLCL_MYTGA | Eukaryota | Mytilus | Perlucin-like protein | 38 | 98 | 69 | 98 |
| PLC_HALLA | Eukaryota | Haliotis | Perlucin | 34 | 97 | 81 | 100 |
Creates nice looking color palettes (discrete).
library(RColorBrewer)
# display.brewer.all()
display.brewer.pal(n = 5, name = 'Set3') # Hexadecimal color specification
Could we color with orange discrete scale the
gosimobject usingsuperheat?. superheat could perform this by using the optionheat.paloption.
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
Visualizing hierarchical information ### Metacoder Metacoder.
Let’s subset the virus lineage annotation:
knitr::kable(table(blast$domain))
| Var1 | Freq |
|---|---|
| . | 16 |
| Archaea | 137 |
| Bacteria | 1245 |
| Eukaryota | 61817 |
| Viruses | 325 |
vs <- blast[blast$domain == "Viruses",]
list <- vs$lineage
And plot …
library(metacoder)
## Loading required package: taxa
##
## Attaching package: 'taxa'
## The following object is masked from 'package:ggplot2':
##
## map_data
## This is metacoder verison 0.2.1.9008 (development version). If you use metacoder for published research, please cite our paper:
##
## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404
##
## Enter `citation("metacoder")` for a BibTeX entry for this citation.
##
## Attaching package: 'metacoder'
## The following object is masked from 'package:IRanges':
##
## reverse
library(RColorBrewer)
obj <- parse_tax_data(vs, class_cols = "lineage", class_sep = ";")
heat_tree(obj,
node_label = taxon_names,
node_size = n_obs,
node_color = n_obs,
#edge_color = "grey",
node_color_range = brewer.pal(n = 10, name = "Oranges"))
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
Tree example using metacoder
Which annotation could be true, let’s show the identity ?
heat_tree(obj,
edge_label = taxon_names,
edge_label_size = 0.1,
node_color = identity,
edge_color = "grey",
node_color_range = brewer.pal(n = 10, name = "Oranges"),
node_size_axis_label = "n_obs",
node_color_axis_label = "Identity")
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
## Warning: NAs found in the "node_color" option. These may cause plotting errors or other strange behavior. NAs can be created if there is not a value for each taxon. The following 28 of 48 taxa have NAs for the "node_color" option:
## ab, ac, ad, ae, af, ag, ah, ai ... ba, bb, bh, bj, bm, bn, bo
Tree example 2; labeling by identity
The ggtree is designed by extending the ggplot2 (Wickham 2009) package. It is based on the grammar of graphics and takes all the good parts of ggplot2. ggtree supports several layouts, including rectangular, slanted, circular and fan for phylogram and cladogram, equal_angle and daylight for unrooted layout, time-scaled and two dimentional phylogenies.
Reference: Yu et al. 2017, https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12628
tbl <- head(vs[order(-vs$identity),], 50)
tbl <- tbl[!duplicated(tbl$transcript),] # remove duplicates annotations
write.table(tbl$transcript, file = paste0(dir, "virus.list"), quote = FALSE, col.names = FALSE, row.names = FALSE)
Run in clustalo web-service and input again in r the tree this is a Neighbour-joining tree without distance corrections.
tree <- treeio::read.newick(paste0(dir,"virus.tree.corrected.txt"))
And plot
library(ggtree)
## ggtree v1.12.0 For help: https://guangchuangyu.github.io/software/ggtree
##
## If you use ggtree in published research, please cite:
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
##
## Attaching package: 'ggtree'
## The following object is masked from 'package:IRanges':
##
## collapse
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:ggpubr':
##
## rotate
## The following object is masked from 'package:magrittr':
##
## inset
multiplot(
ggtree(tree, branch.length = "none") + geom_treescale(width=0.4),
ggtree(tree, branch.length='none', layout="daylight") + geom_treescale(width=0.4) + geom_nodepoint(),
ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4),
ncol=3,labels = LETTERS[1:3])
## Average angle change [1] 0.099133807614635
## Average angle change [2] 0.0415093787405126
ggtree; using different layouts
Then,
p <- ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4) +
geom_nodepoint(color="#b5e521", alpha=1/4, size=4)
Modify tips and node geometry
multiplot(# add node points
p + geom_nodepoint(),
# add tip points
p + geom_tippoint(),
# Label the tips
p + geom_tiplab(),
ncol=3,labels = LETTERS[1:3])
ggtree; using different layouts
Use annotation info to merge within object (ie. tree and dataframe)
d <- p + geom_nodepoint()
tbl <- tbl[,-c(1)]
colnames(tbl)[1] <- "label"
tbl$genus <- gsub("unclassified ", "", tbl$genus)
t1 <- d %<+% tbl +
geom_point(aes(color=genus), size=5, alpha=.5, na.rm = TRUE) + theme(legend.position="bottom")
t2 <- p %<+% tbl +
geom_tiplab(size=2.5, aes(label=paste0('italic(', uniprot, ')')), parse=TRUE)
multiplot(t1, t2, ncol=2)
ggtree; labeling tree based on data-frame
See the multiqc_report.html in the infile folder. If you’re interested in do this in-lab-plots, please visit the brief pre-processing step documentation.